You are working as a data analyst for a yacht and boat sales website. The marketing team is preparing a weekly newsletter for boat owners. The newsletter is designed to help sellers to get more views of their boat, as well as stay on top of market trends. They would like you to take a look at the recent data and help them learn more about the characteristics of the most viewed boat listings in the last 7 days - is it the most expensive boats that get the most views? Are there common features among the most viewed boats?
library(data.table)
library(skimr)
library(tidyverse)
library(plotly)
library(priceR)
source("Functions.R")
dt <- fread("https://s3.amazonaws.com/talent-assets.datacamp.com/boat_data.csv")
skim(dt)
| Name | dt |
| Number of rows | 9888 |
| Number of columns | 10 |
| Key | NULL |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Price | 0 | 1 | 7 | 12 | 0 | 3182 | 0 |
| Boat Type | 0 | 1 | 3 | 43 | 0 | 126 | 0 |
| Manufacturer | 0 | 1 | 0 | 42 | 1338 | 911 | 0 |
| Type | 0 | 1 | 0 | 28 | 6 | 25 | 0 |
| Material | 0 | 1 | 0 | 19 | 1749 | 12 | 0 |
| Location | 0 | 1 | 0 | 84 | 36 | 2841 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Year Built | 0 | 1.00 | 1893.19 | 460.20 | 0.00 | 1996.00 | 2007.00 | 2017.00 | 2021.00 | ▁▁▁▁▇ |
| Length | 9 | 1.00 | 11.57 | 6.00 | 1.04 | 7.47 | 10.28 | 13.93 | 100.00 | ▇▁▁▁▁ |
| Width | 56 | 0.99 | 3.52 | 1.22 | 0.01 | 2.54 | 3.33 | 4.25 | 25.16 | ▇▁▁▁▁ |
| Number of views last 7 days | 0 | 1.00 | 149.16 | 151.82 | 13.00 | 70.00 | 108.00 | 172.00 | 3263.00 | ▇▁▁▁▁ |
Few points to note :
dt$view_pct <- 100*dt$`Number of views last 7 days`/sum(dt$`Number of views last 7 days`)
dt <- dt %>%
separate(Price,c("Currency","Amount"),sep=" ",remove=TRUE,convert=TRUE)
dt <- dt %>%
separate(Type,c("Condition","Fuel"),sep=",",remove=TRUE,convert=TRUE)
dt[Condition=="",Condition:=NA]
dt[Condition=="Diesel" & is.na(Fuel),Fuel:="Diesel"]
dt[Condition=="Diesel" & Fuel=="Diesel",Condition:=NA]
dt[Condition=="Unleaded" & is.na(Fuel),Fuel:="Unleaded"]
dt[Condition=="Unleaded" & Fuel=="Unleaded",Condition:=NA]
dt[Condition=="Electric" & is.na(Fuel),Fuel:="Electric"]
dt[Condition=="Electric" & Fuel=="Electric",Condition:=NA]
dt[,Condition := as.factor(Condition)]
dt[,Fuel := as.factor(Fuel)]
dt[,Material := as.factor(Material)]
Now lets convert the Amount column to a standard currency amount, CAD in this case (based on the currency in “Currency” column and the average exchange rate of that currency to CAD from 1st Jan of this year till yesterday)
conversions <- sapply(unique(dt$Currency),avg_ex) %>%
data.frame() %>%
rownames_to_column() %>%
`colnames<-`(c('Currency','conv'))
dt <- dt %>%
left_join(conversions,by='Currency') %>%
mutate(Amount_CAD = Amount*conv) %>%
select(-c(14))
dt[,Currency := as.factor(Currency)]
Lets skim through our data once more -
skim(dt)
| Name | dt |
| Number of rows | 9888 |
| Number of columns | 14 |
| Key | NULL |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| factor | 4 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Boat Type | 0 | 1 | 3 | 43 | 0 | 126 | 0 |
| Manufacturer | 0 | 1 | 0 | 42 | 1338 | 911 | 0 |
| Location | 0 | 1 | 0 | 84 | 36 | 2841 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Currency | 0 | 1.00 | FALSE | 4 | EUR: 8430, CHF: 980, £: 298, DKK: 180 |
| Condition | 86 | 0.99 | FALSE | 4 | Use: 7327, new: 2084, new: 272, Dis: 119 |
| Fuel | 2212 | 0.78 | FALSE | 6 | Die: 4568, Unl: 3040, Ele: 52, Gas: 13 |
| Material | 0 | 1.00 | FALSE | 12 | GRP: 5484, emp: 1749, PVC: 1123, Ste: 939 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Amount | 0 | 1.00 | 320137.34 | 1007482.24 | 3300.00 | 44000.00 | 95000.00 | 255000.00 | 31000000.00 | ▇▁▁▁▁ |
| Year Built | 0 | 1.00 | 1893.19 | 460.20 | 0.00 | 1996.00 | 2007.00 | 2017.00 | 2021.00 | ▁▁▁▁▇ |
| Length | 9 | 1.00 | 11.57 | 6.00 | 1.04 | 7.47 | 10.28 | 13.93 | 100.00 | ▇▁▁▁▁ |
| Width | 56 | 0.99 | 3.52 | 1.22 | 0.01 | 2.54 | 3.33 | 4.25 | 25.16 | ▇▁▁▁▁ |
| Number of views last 7 days | 0 | 1.00 | 149.16 | 151.82 | 13.00 | 70.00 | 108.00 | 172.00 | 3263.00 | ▇▁▁▁▁ |
| view_pct | 0 | 1.00 | 0.01 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | 0.22 | ▇▁▁▁▁ |
| Amount_CAD | 0 | 1.00 | 444452.15 | 1427370.31 | 4571.60 | 63093.68 | 133971.92 | 364701.33 | 46145883.03 | ▇▁▁▁▁ |
Lets look at the log distribution of the “Number of views last 7 days” metric.
hist(log(dt$`Number of views last 7 days`))
abline(v=mean(log(dt$`Number of views last 7 days`)),col="red",lwd=3)
Above is a histogram of its log distribution. The red vertical line represents the mean of the distribution. The log mean is 4.73 which corresponds to 113.3. What is means is that on average, each listing gets about 113.3 views in 7 days.
Now lets look at the quantiles of the same, to get more idea about the distribution of views.
quantile(dt$`Number of views last 7 days`,c(.7,.8,.9,.95,.99))
## 70% 80% 90% 95% 99%
## 154.00 196.00 283.00 389.00 773.65
As we can see, the top 10% listings in terms of number of views are the ones that have over 283 views in the last 7 days. The top 5% listings get over 389 views in 7 days. This shows that the data is heavily skewed towards lower values of “Number of views last 7 days”, with only a few listings getting much higher views.
Lets look at the proportion of views per boat type, top 10 boat types in terms of views -
dt[, .(views = sum(`Number of views last 7 days`),views_prop = sum(view_pct)),by=.(`Boat Type`)][order(-views)][1:10]
## Boat Type views views_prop
## 1: Motor Yacht 351659 23.842872
## 2: Sport Boat 251418 17.046421
## 3: Cabin Boat 131700 8.929407
## 4: Flybridge 127534 8.646947
## 5: Trawler 75456 5.116001
## 6: Pilothouse 74730 5.066777
## 7: Hardtop 65343 4.430328
## 8: Bowrider 47183 3.199060
## 9: Center console boat 37015 2.509658
## 10: Fishing Boat 32434 2.199061
pie_plot(dt$`Boat Type`)
We can see that “Motor Yacht” boats are getting almost 24% of the total views in the last 7 days, followed by “Sport Boat” at 17% and “Cabin boat” & “Flybridge” at about 8-9% each.
Motor Yachts seem to be the most popular type of boats.
Lets look at the proportion of views per Manufacturer, top 10 in terms of views -
dt[Manufacturer!="", .(views = sum(`Number of views last 7 days`),views_prop = sum(view_pct)),by=.(Manufacturer)][order(-views)][1:10]
## Manufacturer views views_prop
## 1: Bénéteau power boats 56211 3.811168
## 2: Sunseeker power boats 51110 3.465315
## 3: Jeanneau power boats 51070 3.462603
## 4: Sea Ray power boats 45387 3.077289
## 5: Princess power boats 39051 2.647701
## 6: Bavaria power boats 35569 2.411618
## 7: Bayliner power boats 34555 2.342868
## 8: Cranchi power boats 32442 2.199604
## 9: Azimut power boats 24339 1.650211
## 10: Fairline power boats 24276 1.645940
pie_plot(dt$Manufacturer)
Here we see that for a lot of listings, which constitue about 11% of the total views, the Manufacturer is not mentioned. Nonetheless, the most popular manufacturers are -
dt[, .(views = sum(`Number of views last 7 days`),views_prop = sum(view_pct)),by=.(Material)][order(-views)][1:10]
## Material views views_prop
## 1: GRP 899339 60.9761869
## 2: 211346 14.3294944
## 3: Steel 125036 8.4775802
## 4: PVC 95968 6.5067374
## 5: Wood 52296 3.5457271
## 6: Plastic 39678 2.6902126
## 7: Aluminium 38816 2.6317681
## 8: Carbon Fiber 7953 0.5392223
## 9: Thermoplastic 3700 0.2508641
## 10: Hypalon 538 0.0364770
pie_plot(dt$Material)
As seen above, GRP is the most popular material, with 61% of all the listings views being those where the material is GRP
dt[, .(views = sum(`Number of views last 7 days`),views_prop = sum(view_pct)),by=.(Location)][order(-views)][1:10]
## Location views views_prop
## 1: Netherlands » In verkoophaven 35369 2.3980576
## 2: Croatia (Hrvatska) 21476 1.4560967
## 3: Italy » Toscana » Toscana 16229 1.1003443
## 4: France 16096 1.0913267
## 5: Italy 13877 0.9408761
## 6: Italy » Mar Tirreno 11568 0.7843233
## 7: Netherlands 11546 0.7828317
## 8: Spain » Mallorca » Mallorca 11037 0.7483209
## 9: Switzerland » Vierwaldstättersee » Horw 10518 0.7131321
## 10: Italy » Campania » Campania 10454 0.7087929
pie_plot(dt$Location)
Boats from Netherlands location are getting most views.
dt[, .(views = sum(`Number of views last 7 days`),views_prop = sum(view_pct)),by=.(`Year Built`)][order(-views)][1:10]
## Year Built views views_prop
## 1: 2020 175872 11.924318
## 2: 2019 88438 5.996195
## 3: 0 72302 4.902156
## 4: 2006 52614 3.567288
## 5: 2008 51351 3.481655
## 6: 2007 48392 3.281032
## 7: 2017 42420 2.876123
## 8: 2018 42030 2.849681
## 9: 2005 41354 2.803847
## 10: 2004 37143 2.518337
pie_plot(dt$`Year Built`)
Boats from the year 2020 get the most views, about 12% of all views in the last 7 days were of boats built in 2020.
dt[, .(views = sum(`Number of views last 7 days`),views_prop = sum(view_pct)),by=.(Condition)][order(-views)][1:10]
## Condition views views_prop
## 1: Used boat 1136819 77.077596
## 2: new boat from stock 262363 17.788504
## 3: new boat on order 40373 2.737334
## 4: Display Model 23541 1.596106
## 5: <NA> 11806 0.800460
## 6: <NA> NA NA
## 7: <NA> NA NA
## 8: <NA> NA NA
## 9: <NA> NA NA
## 10: <NA> NA NA
pie_plot(dt$Condition)
Used boats are most likely to be viewed, they make up for 77% of all the listing views.
dt[, .(views = sum(`Number of views last 7 days`),views_prop = sum(view_pct)),by=.(Fuel)][order(-views)][1:10]
## Fuel views views_prop
## 1: Diesel 594954 40.33854453
## 2: Unleaded 579908 39.31840895
## 3: <NA> 288466 19.55831642
## 4: Electric 8595 0.58275058
## 5: Gas 2265 0.15356953
## 6: Hybrid 490 0.03322255
## 7: Propane 224 0.01518745
## 8: <NA> NA NA
## 9: <NA> NA NA
## 10: <NA> NA NA
pie_plot(dt$Fuel)
Diesel and Unleaded fuel types make up for most of the viewed boats, at about 80% together.
dt[, .(views = sum(`Number of views last 7 days`),views_prop = sum(view_pct)),by=.(Currency)][order(-views)]
## Currency views views_prop
## 1: EUR 1075324 72.908166
## 2: CHF 347993 23.594313
## 3: £ 28208 1.912534
## 4: DKK 23377 1.584987
pie_plot(dt$Currency)
EUR and CHF currencies are most looked at, with the other 2 being minority.
dtx <- log(dt$Amount_CAD)
dty <- dt$`Number of views last 7 days`
h <- hist(log(dt$Amount_CAD))
breaks <- data.frame("beg"=h$breaks[-length(h$breaks)],"end"=h$breaks[-1])
sums <- apply(breaks, MARGIN=1, FUN=function(x) { sum(dty[dtx >= x[1] & dtx < x[2] ]) })
h$counts <- sums
mean <- mean(log(dt$Amount_CAD))
median <- median(log(dt$Amount_CAD))
plot(h, ylab="Total Views (L7D)", main="Sum of L7D views Within Bins")
abline(v=mean,col="red",lwd=3)
abline(v=median,col="blue",lwd=3)
Most views are ones where the price is close to the mean of the distribution. Thus, no - the boats that are more expensive do not get more views. Most views are at a price range of around CAD 154218.
After going through the data, the following conclusions can be made -